Joins on Encoded and Partitioned Data

نویسندگان

Jae-Gil Lee

Gopi K. Attaluri

Ronald Barber

Naresh Chainani

Oliver Draese

Frederick Ho

Stratos Idreos

Min-Soo Kim

Sam Lightstone

Guy M. Lohman

Konstantinos Morfonios

Keshava Murthy

Ippokratis Pandis

Lin Qiao

Vijayshankar Raman

Vincent KulandaiSamy

Richard Sidle

Knut Stolze

Liping Zhang

چکیده

Compression has historically been used to reduce the cost of storage, I/Os from that storage, and buffer pool utilization, at the expense of the CPU required to decompress data every time it is queried. However, significant additional CPU efficiencies can be achieved by deferring decompression as late in query processing as possible and performing query processing operations directly on the still-compressed data. In this paper, we investigate the benefits and challenges of performing joins on compressed (or encoded) data. We demonstrate the benefit of independently optimizing the compression scheme of each join column, even though join predicates relating values from multiple columns may require translation of the encoding of one join column into the encoding of the other. We also show the benefit of compressing “payload” data other than the join columns “on the fly,” to minimize the size of hash tables used in the join. By partitioning the domain of each column and defining separate dictionaries for each partition, we can achieve even better overall compression as well as increased flexibility in dealing with new values introduced by updates. Instead of decompressing both join columns participating in a join to resolve their different compression schemes, our system performs a light-weight mapping of only qualifying rows from one of the join columns to the encoding space of the other at run time. Consequently, join predicates can be applied directly on the compressed data. We call this procedure encoding translation. Two alternatives of encoding translation are developed and compared in the paper. We provide a comprehensive evaluation of these alternatives using product implementations of each on the TPC-H data set, and demonstrate that performing joins on encoded and partitioned data achieves both superior performance and excellent compression.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Memory-Efficient Hash Joins

We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded populatio...

متن کامل

An Evaluation of Non-Equijoin Algorithms

A non-equijoin of relations R and S is a band join if the join predicate requires values in the join attribute of R to fall within a speci ed band about the values in the join attribute of S. We propose a new algorithm, termed a partitioned band join, for evaluating band joins. We present a comparison between the partitioned band join algorithm and the classical sort-merge join algorithm (optim...

متن کامل

Scalable and Efficient Self-Join Processing technique in RDF data

Efficient management of RDF data plays an important role in successfully understanding and fast querying data. Although the current approaches of indexing in RDF Triples such as property tables and vertically partitioned solved many issues; however, they still suffer from the performance in the complex self-join queries and insert data in the same table. As an improvement in this paper, we prop...

متن کامل

BlockJoin: Efficient Matrix Partitioning Through Joins

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-toend ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii)...

متن کامل

Executing Web Application Queries on a Partitioned Database

Partitioning data over multiple storage servers is an attractive way to increase throughput for web-like workloads. However, there is often no one partitioning that yields good performance for all queries, and it can be challenging for the web developer to determine how best to execute queries over partitioned data. This paper presents DIXIE, a SQL query planner, optimizer, and executor for dat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 7 شماره

صفحات -

تاریخ انتشار 2014

Joins on Encoded and Partitioned Data

نویسندگان

چکیده

منابع مشابه

Memory-Efficient Hash Joins

An Evaluation of Non-Equijoin Algorithms

Scalable and Efficient Self-Join Processing technique in RDF data

BlockJoin: Efficient Matrix Partitioning Through Joins

Executing Web Application Queries on a Partitioned Database

عنوان ژورنال:

اشتراک گذاری